 |
ALPHA-VISION® AvET: Regular Expressions
|
Regular Expressions
SUMMARY
A regular Expression (RE) serves as a template or mask a text to be
searched is compared with.
CONTENTS
- Introduction
- Literal, Meta Character, Anchor
- Meta characters that are supported
- Anchor that are supported
Introduction
A Regular Expression (RE) serves as a template or mask a text to be
searched is compared with. Thereby the RE is interpreted und tested whether
it matches on any position of the searched text.
Some characters do take a special role, they are called Meta Characters.
Literal, Meta Character, Anchor
Literal are the letters A-Z, a-z and the numerics 0-9 as well as most of the
special characters; literal can be taken one to one into a RE.
Meta characters are characters that are not searched for, but describe in
connection with others the string that is searched for. By combining meta
characters you get a powerful tool to spot several search terms by a
single RE.
Literal and meta characters relate to one or more characters while an anchor
represents a position, it does not correspond with a character of the text
to be searched.
Meta characters that are supported
- *
The asterisk after a RE-Part allows an arbitrary occurrence of the
definition on the actual position.
e.g. the RE "ab*" recognizes the strings a, ab, abbbb etc.
- +
The "+" sign equates the asterisk but expects the RE-Part to appear at
least once.
e.g. the RE "ab+" recognizes the strings ab, abb, abbbb etc., but not
the a.
- ?
The "?" after a RE-Part allows the singular appearance in the text to
be searched.
e.g. the RE "ab?" recognizes the strings a and ab, but nothing else.
- [
Characters embedded by square brackets are defined as “Character Class”.
A character is identified if it appears in the defined set. It is
essential that exactly one character of the alternatives occurs at
the actual position.
The alternatives are merely lined up in the RE: [abcd01234], standing
for the letters “a” to “d “ and the digits “0” to “4”. Intervals are
permitted i.e. [a-d0-4] is valid too.
By the open square bracket “[“ a new language level that got
its own meta characters is opened.
The only meta characters used then are “^]” and “-“. Using the meta
character “^” (only if on the first position) you get the complement of
the quantity: [^0-9] denotes “the next character is not a digit”.
WARNING
The interpretation “subsequently no digit is allowed” is wrong, since
it is asked for a character out of the complement quantity.
- .
The decimal point “.” is a special class. It may stand for any ONE
character. The RE “s.art” will find ‘smart’, ‘swart’, ‘start’, but not
‘sart’ or 'sitart'
- (and)
Round brackets define the evaluation sequence of the RE-Part. It is not
searched for the brackets itself;
e.g. the RE a(b*(c?))d recognizes the strings abbcd, abcd, acd, abbd,
abd, ad usw.
- |
An alternation is described by (reg1|reg2|...|regn). The alternates are
tested from the left to the right, it is considered successfully if one
regular sub expression matches
Note:
Alternations are few efficient. If the alternatives are single characters
it is preferable to use a character class:
Write "[Ai]st" instead of "(A|i)st".
e.g. the RA "(bc|d?e|a*f)" recognizes the strings bc, de, e, f, af, aaf etc.
Anchor that are supported
- ^
The “^” make sense at the first position of a RE only, it denotes that
the first character must be recognized too; i.e. the string dabbbbcd is
recognized by the RE “ab*c” but not by the RE “^ab*c”.
- $
Analogue to the “^” the “$” demands that the last character must match;
i.e. the string abbbbcd is recognized by the RE "ab*c" but not by the
RE "ab*c$".